# load packages
suppressPackageStartupMessages({
library(ComplexHeatmap)
library(cowplot)
library(ddSingleCell)
library(dplyr)
library(limma)
library(scater)
})
We provide an examplary SingleCellExperiment (SCE) generated from a subset of data from (Kang et al. 2018). The original dataset contains 10x droplet-based scRNA-seq PBCM data from 8 Lupus patients obtained befor and after 6h-treatment with INF-beta. The complete raw data, as well as gene and cell metadata is available through the NCBI GEO, accession number GSE96583.
For size and runtime reasons, the data has been filtered to
The resulting SCE contains ~700 genes and ~6500 cells. Assay logcounts corresponds to log-normalized values
obtained from normalize (package scater) with default parameters (library size normalization).
# load data
data(sce)
sce
## class: SingleCellExperiment
## dim: 672 6456
## metadata(2): experiment_info log.exprs.offset
## assays(2): counts logcounts
## rownames(672): ENSG00000187608 ENSG00000175756 ... ENSG00000160201
## ENSG00000160213
## rowData names(1): feature
## colnames(6456): CGGGCATGCAGAAA-1 ATACTCTGTGACTG-1 ...
## TTTCAGTGCGAGAG-1 TTTCTACTTCAGAC-1
## colData names(3): sample_id group_id cluster_id
## reducedDimNames(1): tsne
## spikeNames(0):
As we will be aggregating measurements at the cluster-sample level, it is of particular importance to check the number of cells captured for each such instance. While aggregateData (see Section 3) allows excluding cluster-sample combinations with less than a threshold number of cells, clusters or samples with overall very low cell-counts may be excluded from further analysis at this point already.
For the Kang dataset, for example, we might consider removing the Dendritic cells and Megakaryocytes clusters, as these containg less than 50 cells across all samples.
# nb. of cells per cluster-sample
table(sce$cluster_id, sce$sample_id)
##
## ctrl1015 ctrl1256 ctrl1488 stim1015 stim1256 stim1488
## B cells 200 200 200 200 200 200
## CD14+ Monocytes 200 200 200 200 200 200
## CD4 T cells 200 200 200 200 200 200
## CD8 T cells 179 131 58 133 112 43
## Dendritic cells 30 26 43 30 32 54
## FCGR3A+ Monocytes 200 69 101 200 97 144
## Megakaryocytes 20 14 18 20 17 29
## NK cells 177 200 116 195 200 168
The dimensions reductions (DR) available within the SCE can be viewed via reducedDims from the scater package,
and visualized using plotReducedDim. For our dataset, the t-SNE colored by cluster_ids (Fig. 1A) shows that cell-populations are well-separated from one another. INF-beta stimulation manifests as a severe shift in the t-SNE projection of cells when coloring by group_ids (Fig. 1B), indicating widespread, genome-scale transcriptiontal changes.
# t-SNE colored by cluster_id & group_id
ps_tsne <- lapply(c("cluster_id", "group_id"), function(x)
plotReducedDim(sce, use_dimred = "tsne", colour_by = x,
add_ticks = FALSE) + theme(aspect.ratio = 1))
plot_grid(plotlist = ps_tsne, align = "vh", labels = c("A", "B"))
Figure 1: t-SNE
Cells are colored by cluster ID (A) and group ID (B), respectively.
aggregateData() will aggregate single-cell measurements by the colData variables specified with argument by, and return a SingleCellExperiment containing pseudo-bulk data. For DE analysis at the cluster-level, measurements must be aggregated at the cluster-sample level (default by = c("cluster_id", "sample_id"). In this case, the returned SingleCellExperiment will contain one assay per cluster, each with dimensions number of genes times number of samples. Argument assay and fun specify the input data and summary statistic, respectively, to use for aggregation.
pb <- aggregateData(
x = sce, assay = "counts", fun = "sum",
by = c("cluster_id", "sample_id"))
# one sheet per cluster
assayNames(pb)
## [1] "B cells" "CD14+ Monocytes" "CD4 T cells"
## [4] "CD8 T cells" "Dendritic cells" "FCGR3A+ Monocytes"
## [7] "Megakaryocytes" "NK cells"
# PBs for 1st cluster
head(assay(pb))
## ctrl1015 ctrl1256 ctrl1488 stim1015 stim1256 stim1488
## ENSG00000187608 60 54 36 2422 2209 2312
## ENSG00000175756 58 61 68 52 52 58
## ENSG00000157916 34 44 38 46 39 45
## ENSG00000116251 111 89 105 68 61 72
## ENSG00000116288 138 169 189 117 148 139
## ENSG00000074800 210 159 201 134 163 176
Prior to conducting any formal testing, we can compute a multi-dimensional scaling (MDS) plot of aggregated signal to explore overall sample similarities.
pbMDS takes as input any SCE containg PB data as returned by aggregateData, and computes MDS dimensions using edgeR. Ideally, such a represenation of the data should separate both clusters and groups from one-another, while sample from the same cluster and groups should be close to one another.
In our MDS plot on pseudo-bulk counts (Fig. 2), we can observe that the first dimension (MDS1) clearly separates cell-populations (clusters), while the second (MDS2) separates control and stimulated samples (groups). Furthermore, the two T-cell clusters fall close to each other.
pbMDS(pb)
Figure 2: Pseudo-bulk level MDS plot
Points represent cluster-sample instances, are colored by cluster ID, and shaped by group IDs.
Once we have assembled the pseudo-bulk data, we can test for cluster-level DE using runDS. We specify a design matrix capturing the experimental design using model.matrix (package stats), and a contrast matrrix that specifies our comparison of interesting using makeContrasts from the limma package. Alternatively, the comparison of interest (or a list thereof) can be specified with via coefs (see ?glmQLFTest for details).
For the Kang dataset, we want to carry out a single comparison of stimulated against control samples, thus placing "ctrl" on the right-hand side as the reference condition.
# construct design & contrast matrix
ei <- metadata(sce)$experiment_info
design <- model.matrix(~ 0 + ei$group_id)
dimnames(design) <- list(ei$sample_id, levels(ei$group_id))
contrast <- makeContrasts("stim-ctrl", levels = design)
# run DE analysis
res <- runDS(sce, pb, design, contrast, method = "edgeR", verbose = FALSE)
# access results
tbl <- res$table$`stim-ctrl`
# one data.frame per cluster
names(tbl)
## [1] "B cells" "CD14+ Monocytes" "CD4 T cells"
## [4] "CD8 T cells" "Dendritic cells" "FCGR3A+ Monocytes"
## [7] "Megakaryocytes" "NK cells"
# view results for 1st cluster
head(tbl[[1]])
## gene cluster_id logFC logCPM F p_val
## 1 ENSG00000187608 B cells 5.5440098 12.669406 1231.9513592 1.991223e-13
## 2 ENSG00000175756 B cells -0.1918794 8.359226 1.8099927 2.122401e-01
## 3 ENSG00000157916 B cells 0.1785239 7.876081 0.8685524 3.697983e-01
## 4 ENSG00000116251 B cells -0.5865751 8.880400 15.8540164 1.834226e-03
## 5 ENSG00000116288 B cells -0.2784390 9.695360 3.6882618 7.896894e-02
## 6 ENSG00000074800 B cells -0.2571509 9.905444 3.4526918 8.793931e-02
## p_adj contrast
## 1 4.460338e-11 stim-ctrl
## 2 3.547894e-01 stim-ctrl
## 3 5.271614e-01 stim-ctrl
## 4 7.250587e-03 stim-ctrl
## 5 1.676411e-01 stim-ctrl
## 6 1.840972e-01 stim-ctrl
To get a general overview of the differential testing results, we first filter them to retain hits FDR < 5% and abs(logFC) > 1, and count the number and frequency of differential findings by cluster. Finally, we can view the top hits (lowest adj. p-value) in each cluster.
# filter FDR < 5%, abs(logFC) > 1
tbl_fil <- lapply(tbl, function(u)
dplyr::filter(u, p_adj < 0.05, abs(logFC) > 1))
# nb. of DE genes & % of total by cluster
n_de <- vapply(tbl_fil, nrow, numeric(1))
p_de <- format(n_de / nrow(sce) * 100, digits = 3)
data.frame("#DE" = n_de, "%DE" = p_de, check.names = FALSE)
## #DE %DE
## B cells 110 16.37
## CD14+ Monocytes 189 28.12
## CD4 T cells 78 11.61
## CD8 T cells 77 11.46
## Dendritic cells 138 20.54
## FCGR3A+ Monocytes 137 20.39
## Megakaryocytes 40 5.95
## NK cells 92 13.69
# view top 2 hits in each cluster
do.call("rbind", lapply(tbl_fil, function(u)
dplyr::arrange(u, p_adj)[1:2, ]))
## gene cluster_id logFC logCPM
## B cells.1 ENSG00000172183 B cells 3.239742 12.674136
## B cells.2 ENSG00000187608 B cells 5.544010 12.669406
## CD14+ Monocytes.1 ENSG00000172183 CD14+ Monocytes 6.058700 12.485934
## CD14+ Monocytes.2 ENSG00000102524 CD14+ Monocytes 6.012809 10.101459
## CD4 T cells.1 ENSG00000121858 CD4 T cells 3.934817 9.338110
## CD4 T cells.2 ENSG00000177409 CD4 T cells 4.141839 7.913198
## CD8 T cells.1 ENSG00000126709 CD8 T cells 5.345602 11.464966
## CD8 T cells.2 ENSG00000187608 CD8 T cells 4.691455 12.614579
## Dendritic cells.1 ENSG00000187608 Dendritic cells 6.108153 13.992774
## Dendritic cells.2 ENSG00000169245 Dendritic cells 10.205696 13.247729
## FCGR3A+ Monocytes.1 ENSG00000121858 FCGR3A+ Monocytes 4.024035 11.962174
## FCGR3A+ Monocytes.2 ENSG00000204389 FCGR3A+ Monocytes 3.089431 9.834430
## Megakaryocytes.1 ENSG00000172183 Megakaryocytes 3.477222 11.970324
## Megakaryocytes.2 ENSG00000119917 Megakaryocytes 6.888600 10.679613
## NK cells.1 ENSG00000187608 NK cells 4.349408 13.094877
## NK cells.2 ENSG00000172183 NK cells 3.576383 12.541496
## F p_val p_adj contrast
## B cells.1 1545.71123 5.186932e-14 3.485618e-11 stim-ctrl
## B cells.2 1231.95136 1.991223e-13 4.460338e-11 stim-ctrl
## CD14+ Monocytes.1 1908.74907 5.951589e-11 3.999468e-08 stim-ctrl
## CD14+ Monocytes.2 1161.14911 4.460946e-10 1.498878e-07 stim-ctrl
## CD4 T cells.1 393.59382 3.553975e-09 2.388271e-06 stim-ctrl
## CD4 T cells.2 253.57367 2.826916e-08 2.599262e-06 stim-ctrl
## CD8 T cells.1 782.72158 4.715279e-12 3.168667e-09 stim-ctrl
## CD8 T cells.2 656.94645 1.289623e-11 4.333132e-09 stim-ctrl
## Dendritic cells.1 898.82092 3.146701e-11 1.969364e-08 stim-ctrl
## Dendritic cells.2 793.89006 5.861201e-11 1.969364e-08 stim-ctrl
## FCGR3A+ Monocytes.1 683.86450 2.831331e-09 1.902654e-06 stim-ctrl
## FCGR3A+ Monocytes.2 541.77665 7.363511e-09 2.474140e-06 stim-ctrl
## Megakaryocytes.1 103.29120 1.677068e-06 1.126990e-03 stim-ctrl
## Megakaryocytes.2 72.71286 7.917890e-06 2.660411e-03 stim-ctrl
## NK cells.1 817.05756 1.423041e-10 9.562837e-08 stim-ctrl
## NK cells.2 580.81496 7.111416e-10 2.389436e-07 stim-ctrl
In is often worthwhile to filter DE results based on the expression frequencies of each gene, that is, the fraction of cells that express it. calcExprFreqs provides a flexible way of computing cluster-sample/-group wise expression frequencies. Here, a gene is considered to be expressed when the specified measurement value (argument assay) falls above a certain threshold (argument th). Note that, assay = "counts" and th = 0 (default) amounts to the fraction of cells for which a respective gene has been detected.
calcExprFreqs will return a SummarizedExperiment object, where sheets (assays) = clusters, rows = genes, and columns = samples (and groups, if group_ids are present in the colData of the input SCE).
frq <- calcExprFreqs(sce, assay = "counts", th = 0)
# one sheet per cluster
assayNames(frq)
## [1] "B cells" "CD14+ Monocytes" "CD4 T cells"
## [4] "CD8 T cells" "Dendritic cells" "FCGR3A+ Monocytes"
## [7] "Megakaryocytes" "NK cells"
# expr. freqs. for 1st cluster
head(assay(frq))
## ctrl1015 ctrl1256 ctrl1488 stim1015 stim1256 stim1488
## ENSG00000187608 0.180 0.185 0.135 1.000 0.995 1.000
## ENSG00000175756 0.245 0.230 0.225 0.240 0.225 0.240
## ENSG00000157916 0.150 0.180 0.170 0.190 0.170 0.205
## ENSG00000116251 0.420 0.325 0.365 0.255 0.275 0.300
## ENSG00000116288 0.405 0.490 0.470 0.380 0.460 0.385
## ENSG00000074800 0.480 0.355 0.435 0.325 0.400 0.365
## ctrl stim
## ENSG00000187608 0.1666667 0.9983333
## ENSG00000175756 0.2333333 0.2350000
## ENSG00000157916 0.1666667 0.1883333
## ENSG00000116251 0.3700000 0.2766667
## ENSG00000116288 0.4550000 0.4083333
## ENSG00000074800 0.4233333 0.3633333
Especially when testing multiple contrasts or coefficients, the results returned by runDS may become very complex and unhandy for exploration or exporting. Results can be formatted using resDS, which provides two alternative modes for formatting: bind = "row"/"col".
When bind = "row", results from all comparisons will be merged vertically (analogouse to do.call("rbind", ...)) into a tidy format table, with column contrast/coef specifying the comparison.
Otherwise, bind = "col", results will be merge horizontally into a single wide table where all results for a given gene and cluster are kept in one row. An identifier of the respective contrast of coefficient is then appended to the column names. This format is useful when wanting to view a specific gene’s behavior across, for example, multiple treatments, but will become messy when many comparisons are included.
Expression frequencies computed with calcExprFreqs, as well as cluster-sample level avg. CPM, can be included in the results by setting frq/cpm = TRUE. Alternatively, if the former have been pre-computed, they can be supplied directly as an input to resDS (see example below).
tbl_big <- resDS(sce, res, bind = "row", frq = frq, cpm = FALSE)
tbl_tdy <- resDS(sce, res, bind = "col", frq = frq, cpm = FALSE)
We first generate a set of t-SNEs colored by gene expression for the top DE gene in each cluster. To facilitate matching the affected cells to their cluster and experimental group, we also add the t-SNEs colored by cluster and group ID, respectively, that are plotted above (Fig. 1).
# get cluster IDs
cluster_ids <- levels(sce$cluster_id)
names(cluster_ids) <- cluster_ids
ps <- lapply(cluster_ids, function(k) {
u <- dplyr::arrange(tbl[[k]], p_adj) # sort by adj. p-value
g <- u$gene[1] # get top hit
plotReducedDim(sce, "tsne", colour_by = g, add_ticks = FALSE) +
ggtitle(sprintf("%s(%s)", g, k)) + theme_void() +
theme(aspect.ratio = 1, legend.position = "none",
plot.title = element_text(size = 8))
})
row1 <- plot_grid(plotlist = ps_tsne, ncol = 2, align = "vh")
row2 <- plot_grid(plotlist = ps, ncol = 4, align = "vh")
plot_grid(row1, row2, ncol = 1, rel_heights = c(2, 3))
For changes of high interest, we can view the cell-level expression profiles of a specific gene across samples or groups using plotExpression (package scater).
Here, we generate violins plot for the top DE genes (lowest adj. p-value) in the first three clusters (Fig. 3). Note that, as we are testing for DE on the cluster-level, we need to subset the cells that have been assigned to a given cluster for plotting.
# generate violins for top hits by cluster
ps <- lapply(cluster_ids[seq_len(3)], function(k) {
u <- dplyr::arrange(tbl[[k]], p_adj) # sort by adj. p-value
gs <- u$gene[seq_len(5)] # get top hits
plotExpression(sce[, sce$cluster_id == k], # subset this cluster
features = gs, x = "sample_id", colour_by = "group_id", ncol = 5) +
ggtitle(k) + theme(axis.text.x = element_text(angle = 30, hjust = 1))
})
plot_grid(plotlist = ps, ncol = 1)
Figure 3: Violin plots
Show are the top 5 hits (lowest adj. p-value) for the 1st 3 clustes.
Especially when wanting to gain an overview of numerous DE testing results for many clusters, bothm dimension reduction and cell-level visualisations require a lot of space can become cumbersome to interpret. In this setting, it is thus recommendable to visualise aggregated measures, e.g., mean expressions by cluster sample. We can use aggregateData to assemble cluster-sample level mean expression values for all genes, and visualize any hits of interest.
# compute cluster-sample expression means
ms <- aggregateData(sce, assay = "logcounts", fun = "mean")
ms <- data.frame(
row.names = NULL, gene = rownames(ms),
cluster_id = rep(assayNames(ms), each = nrow(sce)),
do.call("rbind", as.list(assays(ms))))
head(ms)
## gene cluster_id ctrl1015 ctrl1256 ctrl1488 stim1015
## 1 ENSG00000187608 B cells 0.2705948 0.2658873 0.1619929 3.7980880
## 2 ENSG00000175756 B cells 0.3039544 0.2996701 0.2871825 0.2800285
## 3 ENSG00000157916 B cells 0.1686634 0.2385715 0.2061330 0.2350590
## 4 ENSG00000116251 B cells 0.5670806 0.4716987 0.5039134 0.3482990
## 5 ENSG00000116288 B cells 0.6018059 0.7405015 0.7346284 0.5432610
## 6 ENSG00000074800 B cells 0.7462413 0.5688830 0.6882017 0.5042638
## stim1256 stim1488
## 1 3.6442352 3.6619709
## 2 0.2806413 0.2786930
## 3 0.1996890 0.2322464
## 4 0.3455977 0.3706336
## 5 0.6794385 0.5502087
## 6 0.6264986 0.5530005
n <- 3 # nb. of genes to plot per cluster
top <- lapply(cluster_ids, function(k) {
u <- dplyr::arrange(tbl[[k]], p_adj)
u <- u[seq_len(n), ]
ms %>% dplyr::filter(gene %in% u$gene & cluster_id %in% u$cluster_id)
})
# assemble means from all clusters
top <- do.call("rbind", top)
# set rownames & remove un-needed columns
rownames(top) <- with(top, sprintf("%s(%s)", gene, cluster_id))
top <- select(top, -c("gene", "cluster_id"))
# plot heatmap of cluster-sample expression means
Heatmap(top,
cluster_rows = FALSE,
cluster_columns = FALSE)
plotDiffGenes
# plot top results for a single cluster
plotDiffGenes(sce, res, k = "B cells")
# plot single gene across all clusters
plotDiffGenes(sce, res, g = rownames(sce)[1])
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.14.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 grid stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] scater_1.10.1 SingleCellExperiment_1.4.1
## [3] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
## [5] BiocParallel_1.16.6 matrixStats_0.54.0
## [7] Biobase_2.42.0 GenomicRanges_1.34.0
## [9] GenomeInfoDb_1.18.2 IRanges_2.16.0
## [11] S4Vectors_0.20.1 BiocGenerics_0.28.0
## [13] limma_3.38.3 dplyr_0.8.0.1
## [15] ddSingleCell_0.99.7 cowplot_0.9.4
## [17] ggplot2_3.1.0 ComplexHeatmap_1.20.0
## [19] BiocStyle_2.10.0
##
## loaded via a namespace (and not attached):
## [1] viridis_0.5.1 edgeR_3.24.3
## [3] splines_3.5.1 viridisLite_0.3.0
## [5] DelayedMatrixStats_1.4.0 assertthat_0.2.0
## [7] highr_0.7 BiocManager_1.30.4
## [9] vipor_0.4.5 GenomeInfoDbData_1.2.0
## [11] yaml_2.2.0 pillar_1.3.1
## [13] lattice_0.20-38 glue_1.3.0
## [15] digest_0.6.18 RColorBrewer_1.1-2
## [17] XVector_0.22.0 colorspace_1.4-0
## [19] htmltools_0.3.6 Matrix_1.2-15
## [21] plyr_1.8.4 pkgconfig_2.0.2
## [23] GetoptLong_0.1.7 bookdown_0.9
## [25] zlibbioc_1.28.0 purrr_0.3.0
## [27] scales_1.0.0 HDF5Array_1.10.1
## [29] tibble_2.0.1 withr_2.1.2
## [31] lazyeval_0.2.1 magrittr_1.5
## [33] crayon_1.3.4 evaluate_0.13
## [35] MAST_1.8.2 beeswarm_0.2.3
## [37] tools_3.5.1 data.table_1.12.0
## [39] GlobalOptions_0.1.0 stringr_1.4.0
## [41] Rhdf5lib_1.4.2 munsell_0.5.0
## [43] locfit_1.5-9.1 compiler_3.5.1
## [45] rlang_0.3.1 rhdf5_2.26.2
## [47] RCurl_1.95-4.11 rjson_0.2.20
## [49] circlize_0.4.5 labeling_0.3
## [51] bitops_1.0-6 rmarkdown_1.11
## [53] gtable_0.2.0 abind_1.4-5
## [55] reshape2_1.4.3 R6_2.4.0
## [57] gridExtra_2.3 knitr_1.21
## [59] shape_1.4.4 ggbeeswarm_0.6.0
## [61] stringi_1.3.1 Rcpp_1.0.0
## [63] tidyselect_0.2.5 xfun_0.4
Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2018. “Multiplexed Droplet Single-Cell Rna-Sequencing Using Natural Genetic Variation.” Nat Biotechnol 36 (1): 89–94. https://doi.org/10.1038/nbt.4042.